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THE ACCEPTABILITY OF REGRESSION SOLUTIONS: 
ANOTHER LOOK AT COMPUTATIONAL ACCUMCY 

Abstract 

Longley proposed a set of data for use in testing regression pro- 
grajns. This paper shows that the numerically accurate solution is likely 
to be an unreasonable estimate of the regression coefficients for this 
problem. This is true because the accuracy of the data and appropriate- 
ness of the model may affect the solution more than the computational 
method. An easily computed index is derived that can be used to indi- 
cate such computational instability. The basic conclusion is that a 
concern about highly accurate computational methods must be tempered 
with a concern for whether, the data are accurate enough to make the 
results of such computation meaningful. 



THE ACCEPTABILITY OF REGRESSION SOLUTIONS: 
ANOTHER LOOK AT COMPUTATIONAL ACCURACY 



V 



Albert E. Beaton, Donald B. Rubin and John Barone 
Educational Testing Service 

1. Introduction 

Multiple regression is an extremely popular and powerfvJ. method of 

jf 

data analysis. Recently, there has been increasirig ccnc^n about the 
numerical accuracy of common computer programs now available and in use. 
The paper of Longley (1967) is perhaps the most startling paper on this sub- 
ject in the recent statistical literature. Longley took v/hat seems to ^e a 
reasonable set of economic data and performed a six-'^ ariable multiple regres- 
sion analysis using several different programs on several different computers. 
He found that different regression programs resulte-i in very different solutions 
including differences in sign and first significant digit. This finding 
seems to indicate that one should be very careftil about the program and 
machine he uses. 

V/e feel that the computer program is often not the most important factor 
in computing a regression analysis, and that the best thing a program can do 
for some problems is to refuse to complete the-, calculations. Numerical 
experiments in this paoer will show that the computationally accurate solu- 
tion to this regression problem--even whe^"! computed using hO decimal digits 
cf accuracy--may be a very poor estimate of regression coefficients in the 
following sense: small e. rors beyo^'id the last decimal place in the data can 
result in solutions more different than those computed by Longley with his 
less preferred programs. The computationally accurate solution is shown to 
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be nowhere near the center of the iistribution of a large number of presxamably 
equally possible solutions* 

The solution to a regression problem is affected by the data, the 
statistical model, and the program. This paT)er will explore the actual 
accuracy of the data used and show in what sense they are inadequate for 
the solution of this model. A reduced statistical model will be fit \inder 
which the resxilts are not serio\isly affected by small errors in the data 
or the particular programming algorithm. Various algorithms are shown to- 
be sufficiently accurate for most practical purposes if a regression model 
has a reasonably stable solution. 

We then show how knowledge of the error variance in the independent 
variables can be used to compute a simple "perturbation index" which indi- 
cates the stability of the computed solution over the range of possible 
true data sets. 

2. The Lonf,ley Problem 

At first glance the Longley problem seems very much like a typical 
multiple regression analysis of a time series In whicH one "dependent" 
variable Y is regressed on six "independent" variables. The variables are 
Y Total Derived Employment (in thousands) 

XI Gross National Product Implicit Price Deflator (in tenths) 

X2 Gross National Product (gNP) (in millions) 

X5 Unemployment (in thousands) 

Xh Size of Armed Forces (in thousands) 

X5 Noninstitutional Population ih Years of Age and Over (in 
thousands) 

X6 Year 



Longley also presented seven components of Total Derived Employment which 
are discussed in Section Longley fit the following regression model 

i ~ 1^2^«««^ 16 • 

The Longley analysis x-ras computed on the data available for the I6 years fi^om 
19^7 to 1962. The basic data are shovm in Table 1 along with the means, 

Insert Table 1 about here 

standard deviations, ratios of the means to standard deviations, and inter- 
correlations . 

The means and standard deviations do not seem to indicate any particular 
difficulty for analysis. A careful research person might try to improve 
computational accuracy through standardizing each variable by subtracting 
its mean and dividing by its standard deviation, thus converting the raw data 
matrix to a matrix of standard scores (Golub, I969)* The ratio of the mean 
to the standard deviation is an indicator of the sort of computational problem 
discussed by Neely (I966); although the ratios here are not zero, as would be 
the case with standard scores, the ratios, except for X6, do not seem unduly 
large. The high intercorrelations anong the independent \ariables portend a 
conditioning problem*^ since no fewer than five of the 15 unique off-diagonal 
correlations are greater than .99 and a sixth is nearly .98. We feel that 
most statisticians would advise a client not to fit a model with such high 
intercorrelations. Nevertheless, Table 2 gives the usual regression .solution 
for this problem produced by the DLSSQ program discussed in detail later. 



Insert Table 2 about here 



Longley fit the model with these data by a high-precision desk calculator 
method and by a number of different regression programs on different machines. 
The results of some of the programs used by Longley and tvo programs used in 
this paper are shown in Table 3. 

Insert Table 5 about here ^ 

The first two lines are the solution of the regression model by the 
high precision desk calculator method and by an IBM lUOl program that per- 
forms calculations using iiO decimal-digit accuracy. We have inserted the 
calculated solution of two programs (DORTHO and DLSSQ) which were written for 
the experiments performed in this pape . All four solutions agree to at least 
seven decimal places and thus may be considered identical for most practical 
purposes. This vector of regression coefficients will be referred to as the 
"unperturbed" solution without regard to the method of calculation. 

remaining part of this table has the solutions computed by Longley 
with eight programs on four different machines. The variations are striking. 
Some programs gen-irate regression coefficients different in sign and in most 
significant digit from the unperturbed solution. The results of the ORTHO pro- 
gram are closest to the unperturbed solution. The ORTHO algorithm has been pub- 
lished by Walsh (1962) and used in the OMNITAB program (Hilsenrath et al. , I966) 
of the National Bureau of Standards. 

3* Is the Unperturbed Solution a Good Solution for This Sample? 

Although calculation to very high precision is satisfying, we wish to 
explore the unperturbed solution further. We do not question that the unperturbed 
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solution is a possible solution to this regression problem, but there are a 
large number of other soliitions, each in a sense as likely to be the correct 
solution as any other or as the unperturbed solution. 

Taking an extremely conservative position, we cannot avoid the likeli- 
hood that the 19k'J value of variable XI, GNP Implicit Price Deflator, v/as not 
exactly 85.0 but some number between 82,5 and 85.^99..; that the Gross 
National Product is not orecisely 254,289,000,000, but some number between 
234,288,500,000.00 and 254,289,499,999,99. All of the variables, XI through 
X5, are subject to this type of deviation. Even X6, Year, is^not an exact 
variable, although it may have a smaller error than the other independent 
variables. For our purposes here, we will ignore errors in the dependent 
variable. Total Derived Employment, even though that measure is clearly not 
exact either, V/e presume, then, that the data in Table 1 are absolutely 
accurate as far as they go, but do not go as far as possible. 

The error introduced by such rounding would seem to be trivial since 

the data are presented with three to six digits of accuracy. To investigate 

this assumption, we have performed a numerical experiment by taking a rand^*"' 

sample of possible exact values to see if these perturbed data sets v;ould 

result in a solution similar oo the mperturbed solution. A sample of 1,000 

plausible sets of six independent variables was generated by adding a 

2 

rectangularly distributed random nimfiber betv/een -.5 and +.499... in the 
digit after the last published digit. All data sets would be exactly the 
same as the publishea set if rounded to the published number of digits, and 
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in this sense each of these data sets is as likely to Ire the exact data as 
any other or as the published data. 

We next computed a tho\isand regression analyses using these perturbed 

5 

values and tne DORTHO subroutine. The results are shown in Table h. The 

Insert Table h about here 

results of this experiment are very striking indeed. Looking first at the 
highest and lowest values of the regression coefficients, these miniscule 
variations in the values of the independent variables have resulted in 

r 

changes in the computed regression coefficients from -252.2792 to 257«0^67 
for Bl, and equivalently elsewhere. There are differences in sign and 
magnitude for all regression coefficients. 

One might hope that nearl/ all possible solutions wbuJ.d agree with the 
unperturbed solution to at least one significant digit. Not so. For all vari- 
ables except XU, the unperturbed regression coefficients agreed to a single 
significant digit with a perturbed solution in about 2% of the cases; agreed 
to at least one place in about 95^ or the cases. Not one of the 1,000 sets of 

estimated regression coefficients agreed with the unperturbed to one decimal 
place in all seven coefficients. 

Perhaps the unperturbed solution is at least near the center of the thousand 
perturbed solutions. But no. The mean and median of the thousand solutions 
are sho\*n in Table ^. For and , the mean and median differ in sign 

from the unperturbed solution; only fax Bi do the unperturbed solution and the 
mean agree to one significant digit. Assuming that the unperturbed solution is 
the true mean of all possible samples and that the sample means are normally 
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distributed about the \inperturbed solution, -ohen we can apply the standard stu- 
dent t -test to the hypotheses that the \inperturbed values are the true popula- 
tion means. The t -statistics are also shown. The hypothesis that the unper- 
turbed solution is' the average of all solutions for these equally likely data 
sets is entirely implausible with the absolute valu^js of the t -statistics rang- 
ing ''froF.' about 22 xo ll6. 

If the number of observations (N = l6) were large, we would expect the 
average of these thousand solutions to be as indicated in the rov; labeled 
P-lim in Table 4a. The reason will be discussed later . For now, notice that 
the average is much closer to P-lim than to the 'impei turbed and that P-lim 
IS not at all close to the vmperturbed. ^ 

These results are shown graphically in Figure 1 vrhich depicts histograms 

m 

Insert Figure 1 about hefe 

of each regression .coefficient including the intercept, . Each histogram 

is centered at the mean of the perturbed solutions and includes the range 
from three standard deviations below to three standard deviations above the 
mean. The vertical line vn.th an encircled U represents the unperturbed 
solution; the abscissa has been extended to include this point wherever 
necessary. The line with an encircled P represents the P-lim value. 

The effect of these -nertui'bations on the squared multiple correlation 
is shown in Table Ub. All R 's are high, but the imperturbed R is not 

near the center of the distribution. It is in fact 65 standard errors 

2 

away from^the mean R . The estimated values of Y for the unperturbed 
solution and the mean of a thousand perturbed solutions are shovm in Table 
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^ he. In all cases the predictions^ a^ree to several places, but it is not true 
that the unperturbed estimates are near the average of the perturbed estimates. 
The absolute values of the t -statistics forjbhe differences betv/een the 
unperturbed and average perturbed solution range from 3,5 to 82. 

We conclude, therefore, that it is extreraely \inlikely that the unperturbed 
solution is the "correct" solution of this problem. r^Assuming uniform ro\ind- 
ing error in the independent variables, it is highly likely that tv^ of the 
tinperturbed coefficients are incorrect in sign and all but one are not correct 

to one significant digit. The unperturbed multiple correlation and estimated 
veilues, although close enough to the average perturbed values for most 
practical purposes, are nevertheless significantly different. 

The unperturbed solution is, therefore, in this sense totally imsatisfactory. 
Since regression analysis is a combination of model, data and algorithm, we 
will now explore these individually to see what effect each has had in the 
analysis. 

k. The Data 

Since a reijreosion analysis can be sensitive to small perturbations of 

I 

the data, we think In v;orthwhile to make some comments about the actual 
accuracy of tlias Longley data. In fact, the error is often many orders of 
magnitude larger than the error v/e have introduced. In general, data are 
subject to errors In sampling, measirrement, ca.lculation, copying, and to a 
host of other factors and, inconsistencies. A thorough analysis of the error in 
these data ij far beyond the scope of this paper, but we will point out a 
few salient facts about each variable. This discussion may seem to be 
focusing on trivia, but we have seen that smaller errors mey have enormous 
effects. 



The Longley data come from four government reports : 

1. The United States Department of Labor "Manpower Report 
of the President and a Report on Manpov/er Requirements, 
Resources, Utilization, and Training," March I965, 
Table ihl. (USDLA) 

2. The United States Department of Labor "Employment and 
Earn-ings," Vol. 10, #5, September I965, Tables A-1 and 
B-1. (USDLB) --^ 

5. The United States Department of Commerce, Office of 
lousiness Jlconomics "Survey Current Business," July 
1965, p. 12. (USDC) 

Council of Economic Advisors, Economic Report of the 
President, January 196^ ^ Table C-6, p. 21^. (CEA) 
USDLA is dram from the Department of Labor Monthly Labor Force Survey. 

Presxjunably, this is the same monthly survey described in USDLB^which is 

described next. 

The USDLB data are dravm from several sources; the prime resources are: 

1. A mon-Ghly labor force survey sampling of 5^^000 house- 
holds in y7 areas in the country. 

2. Payroll employment statistics supplied monthly by a sample 

of industrial, commercia*., and goverrmient establishments 
employing collectively about 25 million v;orkers. 

3. Unemployment contribution reports filed by employers sub- 
ject to state unemployment insurance lav;s . These are 
considered bench-mark data and cover about 75^ of the 
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total nonfarm employees. Other bench-mark data are 
collected from various government agencies. 
Labor turnover statistics, supplied by a sample of manu- 
facturing, mining, and communication industries. 
The USDC bench-mark data come primarily from census data. Also, the 
Bureau of Census provides an annual sample survey of manufacturers which 
are used as a basis for estimating a number of GNP statistics. Comprehen- 
sive annual reports of government agencies and annual data from private 
sources are also used. 

CEA was used for the estimated implicit price deflators which are based 
primarily on daT:a from the Consumer Price Index of the Bureau of Labor 
Statistics. These data are supplemented by information drawn from the 
Agricultural Marketing Service, Department of Commerce, Interstate Commerce 
Commission, and various other government statistics. 

Clearly, data from such sources luay be subject to errors of many different 
types. Let us look at some char';cteristics of individual variables. ^ 

Y. Total Derived Qnployment (in thousands) . The dependent variable is 
the sum of the estimates of 

Yl. Agricultural Employment (from USDLB) 
Y2. Self-Qnployment (from USDLA) 
Y5. Unpaid Family Workers (from USDLA) 
Y^. Domestic Workers (from USDLA) 

Y5. Nonagri cultural Private Workers (this was computed by 
subtracting the total government workers from the 
total nonagri cultural) (from USDLB) 

y6. Federal Workers (from USDLB) 

Y7. State and Local Government Workers (from USDLB) 
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Both data sources warn that there are several periods of none ompar ability 
over time in these components because of the i?itroduction of data from the 
1950 census and the admission to statehood of Alaska and Havraii, Total 
employment figures were increased by 350,000 and 300,000 respectively. 
There is also a systematic difference in the interpretation of figures 
since variables Y2, Y3, and Xh have not been adjvisted for a change in the 
definitions of emplcyment and unemployment adopted in 1957, whereas Yl and 
(we think) Y5, y6, and Y7 have been adjusted. The change in definition 
involved a decrease of 250,000 in total number of employed. The overall 
effect of those errors is in the hundreds of thousands, perhaps millions, 

XI, Gross National Product Implicit Price Deflato r (in tenths). The 
implicit price deflator index is the ratio of GNP in current prices to GNP in 
constant prices. We have not been able to estimate its error. However, the 
publication "U.3, Income and Output, Supplement to Survey of Current Business' 

(1958, p, 52) notes: 

,,,we called attention to the shortcomings of price inflation. 
These stem from lack of price information directly applicable 
to many components of the current-dollar product flow; from the 
fact that, ^-enerall, rpeaking, available price information cannot 
take adequate account of premiums, discounts, and bargain sales; 
and from the even more basic problems encountered in pricing 
items subject to significant quality change, or whose physical 
units are not nearly definable for other reasons. 

X2, Gross National Product (in millions ) > The Gross National Product 
is also subject to many kinds of error. Actually, GNP can be computed from 
either the nation's input or output, and both calculations should result in 
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identical figures. In practice, the estimates do not, and over the years 
from 19^7 '^'O 1962 the discrepancy between the two measures rang^-^s from +5*5 
to -3.0 billion dollars. 

X3* Unemployment, X^. Size of Armed Forcfes, X^. Noninstitutional 
Populat i on (in t hous ands ) * These variables come from the same table (USDIjB) 
and thus have bhe same properties. The data from 19^7 to 1950 have been 
adjusted to reflect changes in the definition of employment and unemployment. ( 
These variables are not comparable over time because of the introduction of data 
from the 1950 census and the introduction of Alaska and Hawaii. 

X6, Year . The yea^' is perhaps the most difficult of these variables to 
understand. It is a catchall variable, and it is difficult to describe j\ist 
what it purports to measure. Gross National Product is the sum of a number 
of things over a year, whereas the population figures (XJ^ X^f, and X5) are 
values that fluctuate during the year and perhaps can be considered average 
values. To what does year refer? Is it a calendar year or fiscal year? Are 
the different employment figures collected at the same point in time? We do 
not know. 

All in all, one has the feeling that these data are good to a little 
over two significant digits' except for X6. One also has the feeling that 
the various government agencies have gone to great pains to make the data 
as accurate as possible. Of course, there are other type^ of error that we 
cannot estimate here. Although these data may be sufficiently accurate for 
some purposes, they clearly are not sufficiently accurate to "support" the 
unperturbed solution. 
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The Model 

The selection of a mathematical model is a very important part of a 
repress analysis. Ideally, a research person specifies a model, then 
collects data to estimate parameters, choosing the independent variables 
carefully to minimize the inter independent variable correlations and avoid- 
ing the problem of multicollinearity • But persons working with observational 
data often cannot avoid high correlations* Th\is, if the research person is 
really interested in performing linear regression using all his variables, 
the results of Section 3 indicate that he may have to collect his data with 
extreme precision to generate reasonable estimates of regression coefficients. 
Such precise measurement is seldon possible. 

On the other hand, many studies of observational data do not have a 
strong causal basis which dictates a model; in fact, many persons use the 
data and a stepwise regression px^ogram to construct a model. The question to 
v/hich we address ourselves here is: Would a different model have been as 
sensitive to such minor errors in the data? 

As indicated above, v/e doubt: that many statisticians would approve of 
a regression analysis Including such highly correlated independent variables. 
We have therefore decided ''o try a different model 



which deletes \ariables such thai no X 's are correlated higher than •95* 

The highest correlation among Lhese variables is .62 between and X, • 

1 5 

We first fit this regression model without perturbation using the DORTHO 

routine. The results are shown in Table 5 which also contains the results 

\ k 
foiVthe same thousand sets of perturbed data. 
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Insert Table 5 about here 

First, the perturbed results always agreed with the \juiperturbed in sign 
and in at least the first significant digit. The lowest and highest va.lues 
are not far enough apart to change interpretation. The t -statistics 
indicate that it is not unreasonable to presume that the solutions from the 
DORTHO routine would be the average of all perturbed solutions. The P-lim 
row indicates that, in contrast to the six-variable problem, the average 
perturbed solution in large samples would be very close to the unperturbed 
solution. 

V/e conclude, then, that this model is not affectv.d rauch by minor perturba- 
tions of these data. In fact, as we shall see shortly, even very poor programs 
compute good solutions to this model. 

6. The Program 

The question of computational accuracy involves both algorithm and 
precision and both must be related to cost. Longley's experiment indicates 
that different programs do Indeed yield different results, and thus one might 
assume that a research person is obliged to use the numerically best program 
for all problems. Our experiment shovrs that the ei'fects of minor errors in 
the data are greater than differences due to prograin; in fact, many of the 
coefficients estimated from the poor programs are closer to the mean of the 

perturbed solutions than the unperturbed solution. But are we usually in trouble 
using less stable programs? 

Many numerical analysts prefer the modified Gram-Schmidt method for 
regression analysis. I'he algorithm avoids computing a cross-products 
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matrix, thus does not "square" Its problems; that is, the condition number of 
the data matrix is the square root of the condition number of the raw cross- 
products matrix, thus almost certainly better conditioned. Most packaged 
regression programs do compute a cross-products matrix and solve the normal 
equations using a matrix inversion subroutine. All the programs in Table 3 
that disagreed (and some of those that agreed) with the unperturbed solution 
tried to solve the normal equations • 

V/hy do programmers fail to use the modified Gram-Schmidt algorithm? 
Basically, modified Gram-Schmidt is too expensive for a general program: it 
requires a pass over the data for each independent variable. The ORTHO 
routine is more efficient, requiring only two passes. If all data can be 
kept in computer memory, then the extra multiplications to calculate regression 
coefficients required by Gram-Schmidt can be justified, especially if residuals 
are to be calculated, but, if the data matrix exceeds computer storage, then 
the nimiber of logical revrinds — whether on tape or disk--will ordinarily 
discourage programmers and users. 

To avoid rewinds, one might use less preferred methods with double pre- 
cision. The Longley paper seems to indicate that double precision does not 
help since the IBM doiiblu precision solution was only trivially better than 
single precision. Hov;ever, this finding is not general but due instead to 
a subtle bug in the IBM program."^ 

To investigate the effects of algorithm and accuracy, we have programmed 
two subroutines, DLSSQ (Double precision Lea£t Squares) and V/RYLG (the Worst 
Routine You are ijikely to Get) in addition to DORTHO. The DLSSQ is not a 
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"good" prograjn but is about what one might expect in packaged regression 
programs. The WRYLG is the single precision version of DLSSQ and is about as 
bad a program as one might come upon, excluding those programs with real bugs. 
Bugs in programs can usually be caught by methods such as those suggested by 
Longley (I967) and by Mullet and Murray (1971). 

Both DLSSQ and WRYLG use fairly standard computing procedures. The 
cross-products are computed without subtracting means. The cross-products 
ZX.X. are centered by the notorious algorithm 



where N is the number of observations and j and k index any pair of 
variables. The inverse is computed by the Gauss-Jordan method. Kvoting is 
done in order without reordering by size. DLSSQ requires that a pivot be 
greater than .01 of the original variance; WRYLG reqxiires only that a pivot 
be positive, no matter how. small. Aside from this, the only difference 
between these routines is that DLSSQ is in double precision on the IBM 560. 

V/e submitted the unperturbed Longley data both of these subprograms. 
The DLSSQ routine agrf-ed quite closely v;ith the desk calculator solution as 
shown in Table 5. The WRYLG is not shown in Table ? since it could not com- 
pute a solution because of a negative pivot. In fact, the single precision 
representation of the augmented cross-products matrix has a negative eigen- 
value and determinant. 



The 1^000 sets of perturbed data v/ere submitted to both DLSSQ and V^RYLG. 
The DLSSQ and DORTliO programs are compared in Table 6. There is an average 



1 0 



N 




V 
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Insert Table 6 about here 

of over 7.5 significant digits of agreement for all coefficients. In the 
worst case the Uro programs agree to digits. Apparently, the perturba- 

tions affect both programs similarly. 

A svaranary of the 1,000 solutions by WRYLG are shown in Table 7- WRYLG 

Insert Table 7 about here 

rejected the problem because of a nonpositive pivot in fully 65^ of the 
samples. The analyses that went to completion tend to vary more than those 
from DORTHO. As v^ith the DORTHO routine, the coefficient of agreed with 

the unperturbed solution to one significant digit in about 95^ of the cases 
in which a solution was produced. V/e might view this program as poorer because 
of the increased variability in results; however, we might also consider WRYLG 
superior because it indicated that the problem has no solution in 6^io of the 
samples . 

The six-variable problem is one in which we know that small errors have 
major effects, but is the VJRYLG adequate for the three-variable problem? 
DL3SQ and DORTKO agree to an average of over 15 digits for the 1,000 data 
sets where three variables are used. A comparison of the WRYLG and DORTHO 

is shown in the bottom of Table 5. V7RYLG and DORTHO agree to an average of 
5.5 to h significant digits,^ even though the t -statistic indicates that 
the average difference is significantly different from zero. The means 'and 
standard deviations of the differences, and the largest and smallest differ- 
ences, are also shovrn. 
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We summarize, then, that the DLSSQ routine agrees substantially with the 
DORTHO routine in every comparison. Single precision routines including MKflG 
are more variable in the poorly conditioned six-variable problem. The WRYLG 
agrees with DORTHO to about 5.5 digits on the average in the three-variable 
problem . 

7. A Pertiirbation Index 

It clearly would be helpful if the standard output of regression programs 
included some indicator of the effect of perturbations on the calculated 
coefficients. A possible candidate is the list of standard errors or t -tests 
of the coefficients; these* however , do not indicate such instability but 
rather the instability of the coefficients over repeated sampling of new Y's 
for the same X*s. Thus the gross instability of the solution might surprise 
some researchers. 

The instabr.lity should not be surprising to numerical analysts«Wilkinson 
(1965, 1967' has developed a system of error analysis which places bounds on 
the effects of perturbation on calculated solutions due to the finite word 
length of computers. The Longley data are so ill-conditioned that the 
Wilkinson procedure does not give an error bound even with double precision 
on the IBM 36O. As Wilkinson says 

It v7ould be more reasonable bo ascribe the loss of accioracy in such 
a case to be apparent sensitivity of the equations rather than to 
speak of "severe accumulation of rounding errors." 
Our motivation here is to indicate how much the solution can vary due to 
perturbations beyond the last supplied digit even assuming an infinite word 
length computer. To do this we first consider the large sample limit of the 
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equally likely solutions ♦ This limit, called P-lim in the previous tables, 
can roughly be thought of as the, center of the distribution of the equally 
likely solutions assuming a large sample size. Since, in large samples, 
almost all perturbed solutions are extremely close to this central value, it 
is the sxoimnary of interest ♦ 

Consider the regression model 

y = T3 + G 

v;here y is an N x 1 vector of dependent variables, T is an N x m 
matrix of correct values of the m independent variables,* 3 is an m x 1 
vector of parameters we v/ish to estimate, and e is the l^l x 1 catchall 
vector for the unpredictable portion of y ♦ Letting X represent the 

actual observed data and the difference between X and T, we have 

8 

T = X + E . V/e assume that \.he expectation of eacn component of E is 
zero and that the individual errors are independent with known variances d. 



e(E) = 0 , I e (E»E) = D = 



0-° 



,,e(E'x) = 0 



If we knew the true values of the independent variables the usual least 
squares estimator would be 

3 = (T'T)"-^T'y = [(X + E)'(X + E)]"^(X + E)'y 

[X'X + E'X + X'E + E'E]"-^(X + E)'y ♦ 

As the sample size N gets larger (n -> oo) the values of quantities in the 

X 'X 

sample approach their population values; more formally, assuming -j^ - C^^ , 

X*v 

a fixed m x m matrix, and ^xy' ^ fixed m- vector for all N 
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P-lim p = [C + bV'^C 

XX xy 



Now the \asual least squares estimator 3^ based on the observed values is 



X XX xy 

The difference between and P-lim p is an indicator of the effect 

of the error perturbation on the calculated solution. This difference is 

B - P-lim p = c'-'-C - (C + D)'-'-C 
X XX xy ^ xy ^ xy 

= [C"^ - (C + D)"^]C 
XX ^ XX ' xy 

= [I - (I + c"^d)"^]g"^c 

XX ' XX xy 



= [I - (I + C-^D)-^]g . 



XX X 



If the matrix c;^D = 0^ , then = P-lim ^ . if c^l is "small" 
compared to I ,Lthen P-Um The phrase "small compared to I" 

really means that all of the eigenvalues of ^ C^D are close to zero. The 
largest eigenvalue of a matrix is commonly used to measure its size and is 
called the "spectral nom. " A simpler measure is the trace of C^^^D, which 
equals the sum of the eigenvalues which are nonnegative because c"-"- is 

XX 

positive and D is nonnegative. If the trace of Q-^B is substantially 
less than 1, we can be confident that = P-Hm p because the largest eigen- 
value must be substantially less than unity. Equivalently , if the "perturbation 

* 

index" 



PI = tr(c"-'-D) 

XX 
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is substantially less than 1, we can be confident that the effect of perturba- 
tion error on the compute^' lution will be small, especially for large N . 
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Alternatively, the perturbation index can be v/ritten 



m . . m 
PI = >: C^^d. = N ^ (X'X)^'^d. 

where C^^ and (X^X)^^ are the diagonal elements of c"-*- and (X'X)"-'-, 

XX XX ' 

respectively. The diagonal elements of C""^ can be written C'^^ = ^ — , 

XX XX 2 ^ 

(1 - R.)s. 
J J 

9 2 
where s is the marginal variance of X. and R. is the squar,ed multiple 

correlation between X^ and all the other X variables. Since D is a 

diagonal matrix with elements d. , the diagonal elements of C""^D are 

.1 XX 

d./s^ 

— — ^7^5 and hence the pertiirbation index can be written 
(1 - R?) 



J' • 2 

PI = z ^ 



m d./; 



j=l (1 - Rp 

/ 2 

The n"Ujnerator of each term of PI , d./s.. ^ is the ratio of the roxmding 

error variance to the marginal variance. The denominator of each term of 

2 

PI , 1 - R. y has been standard output for some time in some regression 

10 

routines as a measure of collinearity . The calculation of the order of 
magnitude of PI is trivial given an estimate of the original rounding error 



variances and either the diagonal of the inverse matrix, or the marginal 
ances and collinearity indices. 
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Since in our experiment the random numbers were rectangular, the value 

of d. was l/l2 for all variables except X for which the value was 1/1200 . 
J 

To compute the P-lim solution for the Longley problem the values of Nd. 

J 

were added to the diagonal of X'X before using the DLSSQ regression sub- 
program. The results are sliovm in Tables ^a and 6. Although the P-lim solu- 
tion is much closer to the center of the distribution than the unperturbed 
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solution, it 3s still significantly different from the mean of the perturbed 

solutions in all coefficients; perhaps this is not surprising because N = l6 

is rather small for our assumptions. 

The values of (X'X)'^'^ and d. , and the collinearity indices for both 

J 

the three- and six-variable problems are shown in Table 8, The perturbation 

-5 

indices are 2.86 x 10 and 2,9777 respectively. 

We note in the three-variable problem that none of the individual 

-If. 

elements of the perturbation index are larger than 10 , v/hich Is very small. 
In the unrtable six-variable problem, the component associated with variable 6 
is much greater than unity so that we would expect instability. V/e reran the 
problem excluding variable 6witTi the results that the perturbed and*the ordinary 

^ r 

least squares solutions agreed to at -least three decimal places and all the 
components of the perturbation index became small. 

Insert Table 8 about here 



Summary 

The purpose of this paper has been to show that regression coefficients 
can fluctuate wildly as a function of seemingly m.inor errors in the data as well 
as by choice of algorithm, precision of the computer program or the model 
-ndped, under the assumptions stated above, computations using very high 
aoT-uracy may result in a solution very unlikely to be close to the most likely 
solution. The results of these experiments are summarized In Table 9. 

[nsert Table 9 about here 

The first line in the table compares J;he ref^ression solution with three 
variables to the solution v;ith six variables. The marked fM'foct, of th*^ 

/ 

ERIC 



-23- 



r 



additiorf"of highly correlated independent variables should be no surprise 
to statisticians, in none of the seven coefficients v/as there a single 
significant digit of agreement. Two coefficients, and , differed 

by orders of magnitude. Since the three -variable model assumes that the 
coefficients B^, , B^ , and Bg are identically zero, the number of digits 
of agreements for these is exactly zero. 

Then the thousand perturbed solutions using DORTHO were compared with 
the unperturbed solutions for both the three- and six-variable problems. 
The perturbations affected the regression coefficients after the second 
significant digit for the three-variable problem. For the six-variable 
problem, no regression coefficient averaged a single significant digit of 
agreement with the unperturbed solution. Three coefficients averaged differ- 
ences in orders of magnitu-le. The, effect of perturbation is, therefore, very 
serious in the highly colli near six-variable model. 

The next comparison is between the DORTHO algorithm and the DISSQ algorithm, 
both of v/hich are double precision. The same thousand perturbed sets of data 
v/ere given to bo'.h programs and the results compared. V/e note that the algorithms 
agreed to an average of over 15 significant digits for the three -variable problem 
and over 7 significant digits in the six-variable problem. V/e conclude, therefore, 
that the choice of algorilbjn is not important if sufficient precision is used. 

The last comparlsoi Is betv/een the double precision (DLSSQ) and single 
precision (V/RYLO) least squares algorithms. It is clear that precision must 
at some point affect a solution. For the three-variable problem, the effect 
of single precision versus double precision v;as, on the average, after the third 
place or approximately an order of magnitude less than the effect of perturbation. 
For the six-variable problem, 65O solutions were rejected because of negative 
pivots, and the remaining yj^ comparisons of sets of regression coefficients did 
not average a single significant digit of agreement; in fact, four coefficients 
were off by an order of magnitude. 



We cannot know what the "true" solution to the Longley problem is> but it 
is clear that the use of stable algorithms and high precision is not likely 
to yield a valid answer without more accurate data. If data are sufficiently 
accurate, then the additional labor of special algorithms may be worth the 
trouble, but we feel that in many cases the attempt at estimating regression 
coefficients in highly collinear problems cannot b^ justified statistically. 
We propose that the perturbation index discussed in section 7 "be used 
routinely to indicate the existence of severe instability in regression 
solutions . 
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Foot notes 

"Hiowever, moderate simple correlations do not guarantee good 

conditioning. 

2 

The uniform pseudo-random nmbers were produced by a double precision 

Tausworthe (1965) generator. 

5 

The ORTHO algorithm which Longley foTiad most satisfactory for fixed 
word-length computers was programmed to investigate the Longley data. Our 
subroutine is called DORTHO and is programmed in double precision for the 
IBM 560/6^. Although the algorithm is classical Gram-Schmidt in nature, it 
avoids numerical instability by a second-stage correction. ^DORTHO produces 
regression coefficients that agree with Longley' s hand calculated solution 
in every published place, vie have been very careful in coding this routine 
since we v/ish there to be no question about its accuracy, even though the 

following argXLT.ents calx for belief in only one or two significant digits. 

h 

The lines involving V/KYLG v/ill be discussed in the next section. 

5 

The IBM solution was computed with two programs (CORRE and MULTR) in 
the IBM Scientific Subroutine Package (SSP). These subroutines were designed 
for single precision (P^^-bit mantissa), but the comments in the program 
instruct the user that, to make the routine double precision, one need only 
make some arrays double precision. A statement using one of these arrays is 

R(JK) = r(JK) + D(J)^D(K) 

where R is a double precision matrix in which sums of squares and cross- 
products are accumulated and D(J) and D(k) are single precision data. 
Unfortunately, In IBM FORTRAN the product of two single precision numbers 
is a single precision number, thus FORTRAM compiles instructions to multiply 
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D(J) by D(K) , then tr\incates the product to single precision; since 
the sum of single precision and double precision addends is double precision, 
the program fills out the least significant part of D(j)^D(k) with zeros 
before adding. These sums of squares and products are, of course, but little 
better than single precision. This FORTRAN idiosyncracy could ha\re been 
avoided by the statement 

R(JK) = R(JK) + DBL(d(i))^D(j) 

and the resulting regression solution would have been considerably closer to 
Longley's . 

We have used the method of calculating significant digits suggested by 
Jordan (1968) and used by Wampler (l970). 

'^To compute the Wilkinson error bound the quantity 

2'^n-\(A) 

must be less than unity, wiiere t = 56 is the number of bits in the mantissa 
of a computer word, n = 7 is the number of equations being solved, and 
k(A) = 2.561 X 10"^^ is the spectral norm. For this problem the result is 
approximately 8.7. 

^Note that these assumptions correspond to the Berkson case (se^ Berkson, 

1950, or Cochran, I968) and are not the usual assiamptions of errors of 

measurement (see Lord & Novick, I968 or Cochran, I968) . 
9 

Actually this relation hoicks only if the vector of constant ones is 
included (i.e., if the regression plane is not forced through the origin) 

and does not hold for the diagonal corresponding to the constant variable, 
p 

More generally, s' should be interpreted as the raw sum of sq\iares f X 

and R. the cosine of the angle betv/een X. and the hyperplane spanned- 

0 ' ^ 

by the other independent variables. 
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■^^This index is called t. by Longley. We note that this index is 

J 

also subject to computational error but our numerical experiments indicate 
that such error is not important. 
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TABLK 5 

Four Variable .deduced Model 
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TABLE 8 
Perturbation Indices 

Diag[(X'X)"^] d. Diag(x'X)"-'-l6d^ 

Four Variable Reauced Model 
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